Gene Signatures - How to score & interpret¶
author: "Marc Elosua Bayes"
date: "09/01/2023"
Introduction¶
Gene signatures are commonly used in routine single cell analysis. Many methods exists but they are not all created equally. In this tutorial we are going to go follow a recent benchmarking paper @badia-i-mompel2022 and follow their guidelines on best practices when scoring gene signatures!
With this tutorial we hope to familiarize you with the concepts of gene signatures, how they are scored in single cell datasets and how to interpret the scores obtained!
Before we start here are some key concepts that will help us and frame the vignette!
What is a gene signature?
A "gene signature" can be stated as a single or a group of genes in a cell having a unique pattern of gene expression that is the consequence of either changed biological process or altered pathogenic medical terms @mallik2018.
What is a cell type signature?
A cell type signature is a gene signature representing a group of genes underlying the biological processes characteristic of a cell type.
How do we score them in our dataset?
Scoring a gene signature means to obtain a value for that signature for each cell in our datasets that represents how active the gene program is in each cell. There are many ways to score gene signatures as shown in the
decoupleRpaper @badia-i-mompel2022. However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don't worry, we'll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.How do we interpret that score?
Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).
Can we interrogate the scores obtained?
Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that score. Sometime with gene signatures containing 50 genes it could be that just a few genes are contributing to the signature. If we just stopped at the score we could be mislead into thinking that all of the genes making up the signature are important when it is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores but vastly different genes gene programs underlying them.
Loading packages¶
import decoupler as dc
# Only needed for visualization:
from plotnine import *
import scanpy as sc
import random
import pandas as pd
Some settings to use throughout the document
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=300, facecolor='white', frameon=False)
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
# Set seed for reproducibility
random.seed(10)
scanpy==1.9.4 anndata==0.9.2 umap==0.5.3 numpy==1.24.3 scipy==1.11.1 pandas==2.0.3 scikit-learn==1.3.0 statsmodels==0.14.0 pynndescent==0.5.10
Load Data¶
adata = sc.datasets.pbmc3k_processed()
try downloading from url https://raw.githubusercontent.com/chanzuckerberg/cellxgene/main/example-dataset/pbmc3k.h5ad ... this may take a while but only happens once
100%|██████████| 23.5M/23.5M [00:00<00:00, 36.9MB/s]
We can visualize the different cell types in it:
sc.pl.umap(adata, color='louvain')
/Users/marc/anaconda3/envs/gs/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:391: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
Gene Signature Scoring¶
Here we define some gene signatures based on prior knowledge. We are setting gene signature that are characteristic for specific cell types to score for their activities in our dataset.
bcell = ["MS4A1", "CD79A", "CD79B", "BANK1", "HLA-DQB1", "HLA-DQA1"]
tcell = ["CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B"]
tnaive = tcell + ["IL7R", "CCR7", "TCF7", "LEF1", "SELL"]
cd8cyto = tcell + ["GZMA", "GZMK", "NKG7", "CCL5"]
mono = ["FCGR3A", "CD14", "S100A9", "S100A8", "MS4A7"]
nks = ["NCR1", "NCR2", "NCR3", "FCGR3A", "GZMA", "GZMK", "NKG7", "CCL5"]
| signature | gene | mor | |
|---|---|---|---|
| 0 | B cells | MS4A1 | 1 |
| 1 | B cells | CD79A | 1 |
| 2 | B cells | CD79B | 1 |
| 3 | B cells | BANK1 | 1 |
| 4 | B cells | HLA-DQB1 | 1 |
| 5 | B cells | HLA-DQA1 | 1 |
| 6 | T cells | CD3D | 1 |
| 7 | T cells | CD3E | 1 |
| 8 | T cells | TRAC | 1 |
| 9 | T cells | TRBC1 | 1 |
| 10 | T cells | TRBC2 | 1 |
| 11 | T cells | CD4 | 1 |
| 12 | T cells | CD8A | 1 |
| 13 | T cells | CD8B | 1 |
| 14 | Naive T cells | CD3D | 1 |
| 15 | Naive T cells | CD3E | 1 |
| 16 | Naive T cells | TRAC | 1 |
| 17 | Naive T cells | TRBC1 | 1 |
| 18 | Naive T cells | TRBC2 | 1 |
| 19 | Naive T cells | CD4 | 1 |
| 20 | Naive T cells | CD8A | 1 |
| 21 | Naive T cells | CD8B | 1 |
| 22 | Naive T cells | IL7R | 1 |
| 23 | Naive T cells | CCR7 | 1 |
| 24 | Naive T cells | TCF7 | 1 |
| 25 | Naive T cells | LEF1 | 1 |
| 26 | Naive T cells | SELL | 1 |
| 27 | CD8 Cytotoxic | CD3D | 1 |
| 28 | CD8 Cytotoxic | CD3E | 1 |
| 29 | CD8 Cytotoxic | TRAC | 1 |
| 30 | CD8 Cytotoxic | TRBC1 | 1 |
| 31 | CD8 Cytotoxic | TRBC2 | 1 |
| 32 | CD8 Cytotoxic | CD4 | 1 |
| 33 | CD8 Cytotoxic | CD8A | 1 |
| 34 | CD8 Cytotoxic | CD8B | 1 |
| 35 | CD8 Cytotoxic | GZMA | 1 |
| 36 | CD8 Cytotoxic | GZMK | 1 |
| 37 | CD8 Cytotoxic | NKG7 | 1 |
| 38 | CD8 Cytotoxic | CCL5 | 1 |
| 39 | Monocytes | FCGR3A | 1 |
| 40 | Monocytes | CD14 | 1 |
| 41 | Monocytes | S100A9 | 1 |
| 42 | Monocytes | S100A8 | 1 |
| 43 | Monocytes | MS4A7 | 1 |
| 44 | NKs | NCR1 | 1 |
| 45 | NKs | NCR2 | 1 |
| 46 | NKs | NCR3 | 1 |
| 47 | NKs | FCGR3A | 1 |
| 48 | NKs | GZMA | 1 |
| 49 | NKs | GZMK | 1 |
| 50 | NKs | NKG7 | 1 |
| 51 | NKs | CCL5 | 1 |
We can see how there are some genes that are specific for each signature but others are shared between them. This is important to take into account when computing the gene signatures and interpreting their scores.
To help us compute these gene signatures we are going to use the R package decoupleR from Bioconductor. decoupleR is a great for carrying out these analysis since it is a framework that contains different statistical methods to compute these scores. Ultimately we will obtain a score for each signature for each cell.
decoupleR requires the gene signatures to be passed as a dataframe so we are going to convert our gene signature vectors into a unified dataframe. mor stands for Mode Of Regulation, at the moment since we don't have a score of how important that gene is for that signature we are going to weight them all equally with a value of 1.
sig_dict = {
"B cells": bcell,
"T cells": tcell,
"Naive T cells": tnaive,
"CD8 Cytotoxic": cd8cyto,
"Monocytes": mono,
"NKs": nks
}
sig_dict
sig_df = pd.concat([pd.DataFrame({'signature': key, 'gene': value, 'mor': 1}) for key, value in sig_dict.items()], ignore_index=True)
sig_df
ULM¶
"Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator."
{width="584"}
Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!
So lets compute the signature scores for every cell in our dataset!
dc.run_ulm(
mat=adata,
net=sig_df,
source='signature',
target='gene',
weight='mor',
verbose=True
)
1 features of mat are empty, they will be removed. Running ulm on mat with 2638 samples and 13713 targets for 6 sources.
100%|██████████| 1/1 [00:00<00:00, 12.35it/s]
The obtained scores (ulm_estimate) and p-values (ulm_pvals) are stored in the .obsm key. We can see how every cells has a score for every signature!
adata.obsm['ulm_estimate']
# Looking at all the scores for one specific cell
adata.obsm['ulm_estimate'].loc['AAACATACAACCAC-1']
# Check if the results are concordant between Python and R
print(adata.obsm['ulm_estimate'].loc['TTTGCATGAGAGGC-1'])
print(adata.obsm['ulm_estimate'].loc['AAGATGGAGATAAG-1'])
B cells 10.568047 CD8 Cytotoxic -0.487455 Monocytes -0.363227 NKs -0.429761 Naive T cells -0.513746 T cells -0.363279 Name: TTTGCATGAGAGGC-1, dtype: float32 B cells 0.617235 CD8 Cytotoxic -0.126399 Monocytes 7.572541 NKs 0.049956 Naive T cells 0.406961 T cells -0.571647 Name: AAGATGGAGATAAG-1, dtype: float32
We can see how the results are maintain the same direction but are slightly different in their magnitude. This can be driven by multiple factors, 1) the normalization is slightly different between Scanpy and Seurat see more details here. Moreover, there are slightly different implementations to compute linear models between R and Python. While R keeps all the features (genes) Python discards those that are all 0s thus decreasing the degrees of freedom.
How does a univariate linear model work?¶
Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:
from scipy.stats import linregress # Import linregress from scipy.stats
# Define data as a DataFrame
df = pd.DataFrame({'vec1': [1, 2, 5], 'vec2': [2.1, 3.8, 9.7]})
# Perform linear regression
slope, intercept, r_value, p_value, std_err = linregress(df['vec1'], df['vec2'])
form = f'y = {slope:.2f}x + {intercept:.2f}'
print(form)
# Create the scatter plot with regression line and equation
p = (
ggplot(df, aes(x='vec1', y='vec2')) +
geom_point() +
geom_abline(intercept=0, slope=1, color="red", linetype="dashed") +
geom_smooth(method='lm', se=False, color='blue', formula='y ~ x') +
coord_fixed() +
xlim(0, 10) +
ylim(0, 10) +
theme_minimal()
)
print(p)
y = 1.92x + 0.09
In the example above we see the linear relationship between both vectors and we get the slope and the T value:\
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).
- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the $\frac{coefficient}{standard~error}$.
For a real world example please visit the Quarto R document here where all the calculations are carried out!
Visualization¶
To visualize the obtained scores, we can re-use many of scanpy’s plotting functions. First though, we need to extract them from the adata object.
acts = dc.get_acts(adata, obsm_key='ulm_estimate')
acts
AnnData object with n_obs × n_vars = 2638 × 6
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr', 'ulm_estimate', 'ulm_pvals'
dc.get_acts returns a new AnnData object which holds the obtained activities in its .X attribute, allowing us to re-use many scanpy functions, for example:
sc.pl.umap(acts, color=acts.var_names, cmap='RdBu_r', vcenter=0)
sc.pl.violin(acts, keys=acts.var_names, groupby='louvain', rotation=90)
Heatmap by groups¶
We can also visualize the gene signature scores for each cell type using a heatmap
sc.pl.heatmap(acts, acts.var_names, groupby='louvain', swap_axes=True, figsize=(12, 8))
sc.pl.matrixplot(acts, acts.var_names, 'louvain', dendrogram=True, standard_scale='var',
colorbar_title='Z-scaled scores', cmap='RdBu_r', figsize=(12, 8))
import session_info
session_info.show(html=False, dependencies=True)
Python version: 3.10.12 (main, Jul 5 2023, 15:02:25) [Clang 14.0.6 ]
Operating System: macOS-13.4-arm64-arm-64bit
System Architecture: ('64bit', '')
Python Info: sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
Package Versions:
Babel: 2.12.1
Bottleneck: 1.3.5
Jinja2: 3.1.2
MarkupSafe: 2.1.3
Pillow: 10.0.0
PyYAML: 6.0.1
Pygments: 2.16.1
Send2Trash: 1.8.2
anndata: 0.9.2
anyio: 4.0.0
appnope: 0.1.3
argon2-cffi: 23.1.0
argon2-cffi-bindings: 21.2.0
arrow: 1.2.3
asttokens: 2.2.1
async-lru: 2.0.4
attrs: 23.1.0
backcall: 0.2.0
beautifulsoup4: 4.12.2
bleach: 6.0.0
certifi: 2023.7.22
cffi: 1.15.1
charset-normalizer: 3.2.0
colorama: 0.4.6
comm: 0.1.4
contourpy: 1.0.5
cycler: 0.11.0
debugpy: 1.6.7.post1
decorator: 5.1.1
decoupler: 1.5.0
defusedxml: 0.7.1
exceptiongroup: 1.1.3
executing: 1.2.0
fastjsonschema: 2.18.0
fonttools: 4.25.0
fqdn: 1.5.1
h5py: 3.9.0
idna: 3.4
importlib-metadata: 6.8.0
ipykernel: 6.25.1
ipython: 8.15.0
isoduration: 20.11.0
jedi: 0.19.0
joblib: 1.3.2
json5: 0.9.14
jsonpointer: 2.4
jsonschema: 4.19.0
jsonschema-specifications: 2023.7.1
jupyter-client: 8.3.1
jupyter-core: 5.3.1
jupyter-events: 0.7.0
jupyter-lsp: 2.2.0
jupyter-server: 2.7.3
jupyter-server-terminals: 0.4.4
jupyterlab: 4.0.5
jupyterlab-pygments: 0.2.2
jupyterlab-server: 2.24.0
kiwisolver: 1.4.4
llvmlite: 0.40.0
matplotlib: 3.7.1
matplotlib-inline: 0.1.6
mistune: 3.0.1
mizani: 0.9.3
munkres: 1.1.4
natsort: 8.4.0
nbclient: 0.8.0
nbconvert: 7.8.0
nbformat: 5.9.2
nest-asyncio: 1.5.7
networkx: 3.1
notebook-shim: 0.2.3
numba: 0.57.0
numexpr: 2.8.4
numpy: 1.24.3
overrides: 7.4.0
packaging: 23.1
pandas: 2.0.3
pandocfilters: 1.5.0
parso: 0.8.3
patsy: 0.5.3
pexpect: 4.8.0
pickleshare: 0.7.5
pip: 23.2.1
platformdirs: 3.10.0
plotnine: 0.12.2
prometheus-client: 0.17.1
prompt-toolkit: 3.0.39
psutil: 5.9.5
ptyprocess: 0.7.0
pure-eval: 0.2.2
pycparser: 2.21
pynndescent: 0.5.10
pyparsing: 3.0.9
python-dateutil: 2.8.2
python-json-logger: 2.0.7
pytz: 2023.3
pyzmq: 25.1.1
referencing: 0.30.2
requests: 2.31.0
rfc3339-validator: 0.1.4
rfc3986-validator: 0.1.1
rpds-py: 0.10.0
scanpy: 1.9.4
scikit-learn: 1.3.0
scipy: 1.11.1
seaborn: 0.12.2
session-info: 1.0.0
setuptools: 68.0.0
six: 1.16.0
sniffio: 1.3.0
soupsieve: 2.4.1
stack-data: 0.6.2
statsmodels: 0.14.0
stdlib-list: 0.8.0
terminado: 0.17.1
threadpoolctl: 3.2.0
tinycss2: 1.2.1
tomli: 2.0.1
tornado: 6.3.3
tqdm: 4.66.1
traitlets: 5.9.0
typing-extensions: 4.7.1
tzdata: 2023.3
umap-learn: 0.5.3
uri-template: 1.3.0
urllib3: 2.0.4
wcwidth: 0.2.6
webcolors: 1.13
webencodings: 0.5.1
websocket-client: 1.6.2
wheel: 0.38.4
zipp: 3.16.2